Skip to content

Add ExpressionAnalyzer for pluggable expression-level statistics estimation#21122

Open
asolimando wants to merge 16 commits intoapache:mainfrom
asolimando:asolimando/ndv-expression-analyzer
Open

Add ExpressionAnalyzer for pluggable expression-level statistics estimation#21122
asolimando wants to merge 16 commits intoapache:mainfrom
asolimando:asolimando/ndv-expression-analyzer

Conversation

@asolimando
Copy link
Copy Markdown
Member

@asolimando asolimando commented Mar 23, 2026

Which issue does this PR close?

Part of #21120 (framework + projection/filter integration)

Rationale for this change

DataFusion currently loses expression-level statistics when computing plan metadata. Projected expressions that aren't bare columns or literals get unknown statistics, and filter selectivity falls back to a hardcoded 20% when interval analysis cannot handle the predicate (e.g. OR predicates, which are not expressible as a single interval). There is also no extension point for users to provide statistics for their own UDFs.

This PR introduces ExpressionAnalyzer, a pluggable chain-of-responsibility framework that addresses these gaps. It follows the same extensibility pattern used elsewhere in DataFusion (ExprPlanner, OptimizerRule, StatisticsRegistry).

What changes are included in this PR?

  • ExpressionAnalyzer trait and ExpressionAnalyzerRegistry (chain-of-responsibility, first Computed wins)
  • DefaultExpressionAnalyzer with Selinger-style estimation: equality/inequality via NDV, AND/OR via inclusion-exclusion, injective arithmetic (+/-), literals, NOT
  • ProjectionExprs and FilterExec use the registry for expression-level statistics
  • Three new default trait methods on ExecutionPlan (uses_expression_level_statistics, with_expression_analyzer_registry, expression_analyzer_registry) for injection, overridden by FilterExec, ProjectionExec, AggregateExec, HashJoinExec, and SortMergeJoinExec
  • The physical planner injects the registry after plan creation and re-injects after each optimizer rule that modifies the plan, ensuring optimizer-created nodes always carry it
  • AggregateStatisticsProvider and JoinStatisticsProvider (feat: Add pluggable StatisticsRegistry for operator-level statistics propagation #21483) consume the registry via the trait getter
  • Config option optimizer.use_expression_analyzer (default false), zero overhead when disabled

Are these changes tested?

  • 21 unit tests for ExpressionAnalyzer
  • 4 integration tests verifying registry injection survives optimizer rules
  • 5 end-to-end SLT tests (OR selectivity, TopK, UNION ALL, hash join filter pushdown)
  • Existing projection and filter test suites pass unchanged

Are there any user-facing changes?

New public API (purely additive, non-breaking):

  • ExpressionAnalyzer trait and ExpressionAnalyzerRegistry in datafusion-physical-expr
  • SessionState::expression_analyzer_registry() getter
  • SessionStateBuilder::with_expression_analyzer_registry() setter
  • Three default trait methods on ExecutionPlan: uses_expression_level_statistics(), with_expression_analyzer_registry(), expression_analyzer_registry()
  • Config option datafusion.optimizer.use_expression_analyzer

No breaking changes. Default behavior is unchanged (config defaults to false).


Disclaimer: I used AI to assist in the code generation, I have manually reviewed the output and it matches my intention and understanding.

@github-actions github-actions Bot added physical-expr Changes to the physical-expr crates core Core DataFusion crate common Related to common crate physical-plan Changes to the physical-plan crate documentation Improvements or additions to documentation sqllogictest SQL Logic Tests (.slt) labels Mar 23, 2026
@asolimando asolimando force-pushed the asolimando/ndv-expression-analyzer branch from 322b97f to f101c51 Compare March 25, 2026 12:05
@asolimando asolimando force-pushed the asolimando/ndv-expression-analyzer branch 2 times, most recently from dfa1324 to f6f27ac Compare April 1, 2026 17:18
@asolimando asolimando marked this pull request as ready for review April 1, 2026 17:18
@asolimando
Copy link
Copy Markdown
Member Author

@2010YOUY01: FYI I took a final pass on the PR and marked it as "reviewable"

Copy link
Copy Markdown
Contributor

@kosiew kosiew left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@asolimando

Thanks for the solid work here. The new analyzer direction looks promising, but I ran into a couple of issues that could affect correctness and usefulness of the stats. Left a few detailed comments below.

Comment thread datafusion/physical-expr/src/projection.rs
Comment thread datafusion/physical-expr/src/expression_analyzer/default.rs Outdated
Comment thread datafusion/physical-expr/src/expression_analyzer/default.rs Outdated
@asolimando
Copy link
Copy Markdown
Member Author

@asolimando

Thanks for the solid work here. The new analyzer direction looks promising, but I ran into a couple of issues that could affect correctness and usefulness of the stats. Left a few detailed comments below.

Thanks to you @kosiew for the spot-on review, and for sharing your feedback. I have addressed the requested changes, happy to iterate further if needed!

@asolimando asolimando requested a review from kosiew April 2, 2026 17:38
Copy link
Copy Markdown
Contributor

@kosiew kosiew left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@asolimando

Thanks for the follow-up here. The issues called out in the earlier review look addressed, including the projection ordering fix, the equality and inequality selectivity registry lookup, and narrowing the injective arithmetic NDV rule. I did notice one remaining gap around registry propagation through planner and optimizer-created projections.

Comment thread datafusion/core/src/physical_planner.rs Outdated
Copy link
Copy Markdown
Contributor

@kosiew kosiew left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@asolimando, I spotted an issue with NDV handling that can affect selectivity estimates depending on operand order.

Comment thread datafusion/physical-expr/src/expression_analyzer/default.rs Outdated
github-merge-queue Bot pushed a commit that referenced this pull request Apr 13, 2026
…propagation (#21483)

## Which issue does this PR close?

- Part of #21443 (Pluggable operator-level statistics propagation)
- Part of #8227 (statistics improvements epic)

## Rationale for this change

DataFusion's built-in statistics propagation has no extension point:
downstream projects cannot inject external catalog stats, override
built-in estimation, or plug in custom strategies without forking.

This PR introduces `StatisticsRegistry`, a pluggable
chain-of-responsibility for operator-level statistics following the same
pattern as `RelationPlanner` for SQL parsing and `ExpressionAnalyzer`
(#21120) for expression-level stats. See #21443 for full motivation and
design context.

## What changes are included in this PR?

1. Framework (`operator_statistics/mod.rs`): `StatisticsProvider` trait,
`StatisticsRegistry` (chain-of-responsibility), `ExtendedStatistics`
(Statistics + type-erased extension map), `DefaultStatisticsProvider`.
`PhysicalOptimizerContext` trait with `optimize_with_context` dispatch.
`SessionState` integration.

2. Built-in providers for Filter, Projection, Passthrough
(sort/repartition/etc), Aggregate, Join
(hash/sort-merge/nested-loop/cross), Limit, and Union. NDV utilities:
`num_distinct_vals`, `ndv_after_selectivity`.

3. `ClosureStatisticsProvider`: closure-based provider for test
injection and cardinality feedback.

4. JoinSelection integration: `use_statistics_registry` config flag
(default false), registry-aware `optimize_with_context`, SLT test
demonstrating plan difference on skewed data.

## Are these changes tested?

- 39 unit tests covering all providers, NDV utilities, chain priority,
and edge cases (Inexact precision, Absent propagation, Partial aggregate
delegation, GROUPING SETS delegation, join-type bounds, multi-key NDV,
exact Cartesian product, CrossJoin, GlobalLimit skip+fetch)
- 1 SLT test (`statistics_registry.slt`): three-table join on skewed
data (8:1:1 customer_id distribution) where the built-in NDV formula
estimates 33 rows (wrong; actual=66) and the registry conservatively
estimates 100, producing the correct build-side swap

## Are there any user-facing changes?

New public API (purely additive, non-breaking):
- `StatisticsProvider` trait and `StatisticsRegistry` in
`datafusion-physical-plan`
- `ExtendedStatistics`, `StatisticsResult` types; built-in provider
structs; `num_distinct_vals`, `ndv_after_selectivity` utilities
- `PhysicalOptimizerContext` trait and `ConfigOnlyContext` in
`datafusion-physical-optimizer`
- `SessionState::statistics_registry()`,
`SessionStateBuilder::with_statistics_registry()`
- Config: `datafusion.optimizer.use_statistics_registry` (default false)

Default behavior is unchanged. The registry is only consulted when the
flag is explicitly enabled.

Known limitations:
- Column-level stats (NDV, min/max) at Join/Aggregate/Union/Limit
boundaries are not improved: these operators call
`partition_statistics(None)` internally, re-fetching raw child stats and
discarding registry enrichment. 4 TODO comments mark the affected call
sites; #20184 would close this gap.
- No `ExpressionAnalyzer` integration yet (#21122).

---
Disclaimer: I used AI to assist in the code generation, I have manually
reviewed the output and it matches my intention and understanding.
coderfender pushed a commit to coderfender/datafusion that referenced this pull request Apr 14, 2026
…propagation (apache#21483)

## Which issue does this PR close?

- Part of apache#21443 (Pluggable operator-level statistics propagation)
- Part of apache#8227 (statistics improvements epic)

## Rationale for this change

DataFusion's built-in statistics propagation has no extension point:
downstream projects cannot inject external catalog stats, override
built-in estimation, or plug in custom strategies without forking.

This PR introduces `StatisticsRegistry`, a pluggable
chain-of-responsibility for operator-level statistics following the same
pattern as `RelationPlanner` for SQL parsing and `ExpressionAnalyzer`
(apache#21120) for expression-level stats. See apache#21443 for full motivation and
design context.

## What changes are included in this PR?

1. Framework (`operator_statistics/mod.rs`): `StatisticsProvider` trait,
`StatisticsRegistry` (chain-of-responsibility), `ExtendedStatistics`
(Statistics + type-erased extension map), `DefaultStatisticsProvider`.
`PhysicalOptimizerContext` trait with `optimize_with_context` dispatch.
`SessionState` integration.

2. Built-in providers for Filter, Projection, Passthrough
(sort/repartition/etc), Aggregate, Join
(hash/sort-merge/nested-loop/cross), Limit, and Union. NDV utilities:
`num_distinct_vals`, `ndv_after_selectivity`.

3. `ClosureStatisticsProvider`: closure-based provider for test
injection and cardinality feedback.

4. JoinSelection integration: `use_statistics_registry` config flag
(default false), registry-aware `optimize_with_context`, SLT test
demonstrating plan difference on skewed data.

## Are these changes tested?

- 39 unit tests covering all providers, NDV utilities, chain priority,
and edge cases (Inexact precision, Absent propagation, Partial aggregate
delegation, GROUPING SETS delegation, join-type bounds, multi-key NDV,
exact Cartesian product, CrossJoin, GlobalLimit skip+fetch)
- 1 SLT test (`statistics_registry.slt`): three-table join on skewed
data (8:1:1 customer_id distribution) where the built-in NDV formula
estimates 33 rows (wrong; actual=66) and the registry conservatively
estimates 100, producing the correct build-side swap

## Are there any user-facing changes?

New public API (purely additive, non-breaking):
- `StatisticsProvider` trait and `StatisticsRegistry` in
`datafusion-physical-plan`
- `ExtendedStatistics`, `StatisticsResult` types; built-in provider
structs; `num_distinct_vals`, `ndv_after_selectivity` utilities
- `PhysicalOptimizerContext` trait and `ConfigOnlyContext` in
`datafusion-physical-optimizer`
- `SessionState::statistics_registry()`,
`SessionStateBuilder::with_statistics_registry()`
- Config: `datafusion.optimizer.use_statistics_registry` (default false)

Default behavior is unchanged. The registry is only consulted when the
flag is explicitly enabled.

Known limitations:
- Column-level stats (NDV, min/max) at Join/Aggregate/Union/Limit
boundaries are not improved: these operators call
`partition_statistics(None)` internally, re-fetching raw child stats and
discarding registry enrichment. 4 TODO comments mark the affected call
sites; apache#20184 would close this gap.
- No `ExpressionAnalyzer` integration yet (apache#21122).

---
Disclaimer: I used AI to assist in the code generation, I have manually
reviewed the output and it matches my intention and understanding.
@asolimando asolimando force-pushed the asolimando/ndv-expression-analyzer branch from 54a0df7 to ae2a0b8 Compare April 15, 2026 17:43
@github-actions github-actions Bot added the optimizer Optimizer rules label Apr 15, 2026
@asolimando asolimando force-pushed the asolimando/ndv-expression-analyzer branch 2 times, most recently from c37838b to 42d0f8e Compare April 15, 2026 19:41
@asolimando asolimando requested a review from kosiew April 15, 2026 20:15
@asolimando
Copy link
Copy Markdown
Member Author

asolimando commented Apr 15, 2026

@kosiew the CI error seems unrelated, the same test suite passed for macos but not amd64, I can't reproduce and it doesn't seem it can be caused by changes in this PR, it feels like a flaky test but I'd like to get your opinion.

EDIT: confirmed it was a flaky test, fixed by #21657, force-pushing to resolve conflicts with main, no real changes w.r.t. the updated status I gave above, just a mechanical fix due to as_any removal

@asolimando asolimando force-pushed the asolimando/ndv-expression-analyzer branch from b6d2621 to 189287f Compare April 16, 2026 09:50
Copy link
Copy Markdown
Contributor

@kosiew kosiew left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@asolimando,

Thanks for the updates here, a lot of the earlier concerns have been addressed nicely. The projection ordering fixes and the tighter NDV handling look solid. I took another pass and things are generally in good shape, but there are still a couple of edge cases and behavioral inconsistencies worth tightening up before landing.

Comment thread datafusion/core/src/physical_planner.rs
Comment thread datafusion/physical-plan/src/operator_statistics/mod.rs Outdated
@asolimando asolimando requested a review from kosiew April 16, 2026 12:16
Introduce ExpressionAnalyzer, a chain-of-responsibility framework for
expression-level statistics estimation (NDV, selectivity, min/max).

Framework:
- ExpressionAnalyzer trait with registry parameter for chain delegation
- ExpressionAnalyzerRegistry to chain analyzers (first Computed wins)
- DefaultExpressionAnalyzer: Selinger-style estimation for columns,
  literals, binary expressions, NOT, boolean predicates

Integration:
- ExpressionAnalyzerRegistry stored in SessionState, initialized once
- ProjectionExprs stores optional registry (non-breaking, no signature
  changes to project_statistics)
- ProjectionExec sets registry via Projector, injected by planner
- FilterExec uses registry for selectivity when interval analysis
  cannot handle the predicate
- Custom nodes get builtin analyzer as fallback when registry is absent
- Regenerate configs.md for new enable_expression_analyzer option
- Add enable_expression_analyzer to information_schema.slt expected output
- Fix unresolved doc links to SessionState and DefaultExpressionAnalyzer
  (cross-crate references use backticks instead of doc links)
- Simplify config description
…putation

- Fix expression_analyzer_registry doc comment misplaced between
  function_factory's doc comment and field declaration
- Fix module doc example import path (physical_plan -> physical_expr)
- Extract expression_analyzer_registry() helper in planner to avoid
  repeating the config check 4 times
- Defer left_sel/right_sel computation to AND/OR arms only, avoiding
  unnecessary sub-expression selectivity estimation for comparison
  operators
…ptimizer loop

Add trait methods on ExecutionPlan for expression-level statistics injection
(uses_expression_level_statistics, with_expression_analyzer_registry,
expression_analyzer_registry). The physical planner injects the registry
after plan creation and re-injects after each optimizer rule that modifies
the plan, gated by the use_expression_analyzer config flag.
…ty for OR predicates

OR predicates are inherently outside interval arithmetic (a union of two
disjoint intervals cannot be represented as a single interval). This test
confirms that ExpressionAnalyzerRegistry computes the correct
inclusion-exclusion selectivity (0.28 = 0.1 + 0.2 - 0.02) on a 1000-row
input, versus the default 20% (200 rows) without a registry.
…ailable

Return Delegate for all leaf predicates when NDV is unavailable, and
propagate Delegate upward through AND/OR/NOT when any child has no
estimate. DefaultExpressionAnalyzer now only produces a result when it
has a genuine information advantage (NDV from column statistics).
… OR selectivity

Add a reusable StatisticsTable (TableProvider + ExecutionPlan with user-supplied
statistics) to the sqllogictest harness, and use it in expression_analyzer.slt
Copy link
Copy Markdown
Contributor

@kosiew kosiew left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks good to me.
Looking forward to approving this after merge conflict is resolved.

@asolimando asolimando force-pushed the asolimando/ndv-expression-analyzer branch from b79d090 to f2708a7 Compare April 20, 2026 13:29
@asolimando
Copy link
Copy Markdown
Member Author

This looks good to me. Looking forward to approving this after merge conflict is resolved.

Thanks @kosiew, I have resolved the conflicts and force-pushed again, mechanical fixes only.

Rich-T-kid pushed a commit to Rich-T-kid/datafusion that referenced this pull request Apr 21, 2026
…propagation (apache#21483)

## Which issue does this PR close?

- Part of apache#21443 (Pluggable operator-level statistics propagation)
- Part of apache#8227 (statistics improvements epic)

## Rationale for this change

DataFusion's built-in statistics propagation has no extension point:
downstream projects cannot inject external catalog stats, override
built-in estimation, or plug in custom strategies without forking.

This PR introduces `StatisticsRegistry`, a pluggable
chain-of-responsibility for operator-level statistics following the same
pattern as `RelationPlanner` for SQL parsing and `ExpressionAnalyzer`
(apache#21120) for expression-level stats. See apache#21443 for full motivation and
design context.

## What changes are included in this PR?

1. Framework (`operator_statistics/mod.rs`): `StatisticsProvider` trait,
`StatisticsRegistry` (chain-of-responsibility), `ExtendedStatistics`
(Statistics + type-erased extension map), `DefaultStatisticsProvider`.
`PhysicalOptimizerContext` trait with `optimize_with_context` dispatch.
`SessionState` integration.

2. Built-in providers for Filter, Projection, Passthrough
(sort/repartition/etc), Aggregate, Join
(hash/sort-merge/nested-loop/cross), Limit, and Union. NDV utilities:
`num_distinct_vals`, `ndv_after_selectivity`.

3. `ClosureStatisticsProvider`: closure-based provider for test
injection and cardinality feedback.

4. JoinSelection integration: `use_statistics_registry` config flag
(default false), registry-aware `optimize_with_context`, SLT test
demonstrating plan difference on skewed data.

## Are these changes tested?

- 39 unit tests covering all providers, NDV utilities, chain priority,
and edge cases (Inexact precision, Absent propagation, Partial aggregate
delegation, GROUPING SETS delegation, join-type bounds, multi-key NDV,
exact Cartesian product, CrossJoin, GlobalLimit skip+fetch)
- 1 SLT test (`statistics_registry.slt`): three-table join on skewed
data (8:1:1 customer_id distribution) where the built-in NDV formula
estimates 33 rows (wrong; actual=66) and the registry conservatively
estimates 100, producing the correct build-side swap

## Are there any user-facing changes?

New public API (purely additive, non-breaking):
- `StatisticsProvider` trait and `StatisticsRegistry` in
`datafusion-physical-plan`
- `ExtendedStatistics`, `StatisticsResult` types; built-in provider
structs; `num_distinct_vals`, `ndv_after_selectivity` utilities
- `PhysicalOptimizerContext` trait and `ConfigOnlyContext` in
`datafusion-physical-optimizer`
- `SessionState::statistics_registry()`,
`SessionStateBuilder::with_statistics_registry()`
- Config: `datafusion.optimizer.use_statistics_registry` (default false)

Default behavior is unchanged. The registry is only consulted when the
flag is explicitly enabled.

Known limitations:
- Column-level stats (NDV, min/max) at Join/Aggregate/Union/Limit
boundaries are not improved: these operators call
`partition_statistics(None)` internally, re-fetching raw child stats and
discarding registry enrichment. 4 TODO comments mark the affected call
sites; apache#20184 would close this gap.
- No `ExpressionAnalyzer` integration yet (apache#21122).

---
Disclaimer: I used AI to assist in the code generation, I have manually
reviewed the output and it matches my intention and understanding.
Copy link
Copy Markdown
Member

@xudong963 xudong963 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @asolimando!

The algorithmic contribution is genuinely useful, and thanks for your patience in keeping at it.

But the registry-injection plumbing adds three permanent ExecutionPlan trait methods and five struct fields to work around a missing parameter that #20184 is poised to add properly.

Before landing, I'd want alignment on whether to (a) wait for #20184 and land this on top of a StatisticsContext parameter, or (b) accept the injection design as a permanent API surface

@xudong963 xudong963 requested review from adriangb and alamb April 21, 2026 06:59
@asolimando
Copy link
Copy Markdown
Member Author

Thanks @asolimando!

The algorithmic contribution is genuinely useful, and thanks for your patience in keeping at it.

But the registry-injection plumbing adds three permanent ExecutionPlan trait methods and five struct fields to work around a missing parameter that #20184 is poised to add properly.

Before landing, I'd want alignment on whether to (a) wait for #20184 and land this on top of a StatisticsContext parameter, or (b) accept the injection design as a permanent API surface

Thanks @xudong963 for your thoughtful feedback, the injection design was meant as a stepping stone, not a permanent API surface, and I totally agree that adding the StatisticsContext parameter is the right long term solution, buying us freedom to provide an even richer context in the future. I wanted to mark the new trait methods as "experimental" but I couldn't find a proper mechanism for that in DataFusion.

Re. (b), I was under the impression that #20184 might land pretty soon, so I was basically counting on rebasing before finalizing, or shortly after, as breaking changes within the same unreleased version are generally tolerated, but if the timeline to merge #20184 is uncertain, (b) might not be ideal.

The cons of (a) is the risk of conflicts requiring to force-push, making it harder for reviewers to check incrementally.

In case it makes your decision easier, I can commit on helping with #20184's implementation, under both scenarios, if the current assignee is busy.

WDYT?

@xudong963
Copy link
Copy Markdown
Member

xudong963 commented Apr 22, 2026

In case it makes your decision easier, I can commit on helping with #20184's implementation, under both scenarios, if the current assignee is busy.

Yes, please. If the current assignee is busy, taking that on is probably the fastest path to unblocking this PR in its final form. Happy to review on that side too. Also cc @jonathanc-n

I also want to hear some suggestions from @alamb about the next step!

@asolimando
Copy link
Copy Markdown
Member Author

In case it makes your decision easier, I can commit on helping with #20184's implementation, under both scenarios, if the current assignee is busy.

Yes, please. If the current assignee is busy, taking that on is probably the fastest path to unblocking this PR in its final form. Happy to review on that side too. Also cc @jonathanc-n

I also want to hear some suggestions from @alamb about the next step!

Hey @xudong963, I have opened #21815 to close #20184 as requested (I will ping you there too)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

common Related to common crate core Core DataFusion crate documentation Improvements or additions to documentation optimizer Optimizer rules physical-expr Changes to the physical-expr crates physical-plan Changes to the physical-plan crate sqllogictest SQL Logic Tests (.slt)

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants